This is an R markdown document to accompany the main blog post. It downloads all the data and generates all the figures for the blog (except for results drawn from other papers).

Dependencies

options(stringsAsFactors = FALSE)
library(Matrix)
library(DelayedArray)
library(Seurat)
library(ggplot2)
library(irlba)
library(patchwork)
library(plyr)
library(dplyr)
library(stringr)
library(SnapATAC)
library(GenomicRanges)
library(cisTopic)

The libraries loaded above must all be installed. Note that several PCA steps are involved that can be much faster depending on which version of BLAS your R installation is linked against. For example, I am running this on my Mac and it uses the Accelerate framework version of BLAS, which is quite fast (and if you are on a Mac, you will probably find the same). This can be variable across platforms, so I just want to mention it as it can result in fairly large speed differences. Speed may also be vary by the size of the dataset and this is not something I have examined in too much detail yet.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin17.6.0 (64-bit)
## Running under: macOS  10.14
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cisTopic_0.2.1       GenomicRanges_1.34.0 GenomeInfoDb_1.16.0 
##  [4] SnapATAC_1.0.0       rhdf5_2.24.0         stringr_1.3.1       
##  [7] dplyr_0.7.8          plyr_1.8.4           patchwork_0.0.1     
## [10] irlba_2.3.3          ggplot2_3.1.1        Seurat_3.0.0.9000   
## [13] DelayedArray_0.6.6   BiocParallel_1.14.2  IRanges_2.16.0      
## [16] S4Vectors_0.20.1     BiocGenerics_0.26.0  matrixStats_0.54.0  
## [19] Matrix_1.2-14       
## 
## loaded via a namespace (and not attached):
##   [1] snow_0.4-2                  backports_1.1.2            
##   [3] igraph_1.2.4.1              lazyeval_0.2.2             
##   [5] sp_1.3-1                    GSEABase_1.42.0            
##   [7] splines_3.5.1               listenv_0.7.0              
##   [9] feather_0.3.3               digest_0.6.18              
##  [11] foreach_1.4.4               htmltools_0.3.6            
##  [13] gdata_2.18.0                lda_1.4.2                  
##  [15] magrittr_1.5                memoise_1.1.0              
##  [17] cluster_2.0.7-1             doParallel_1.0.14          
##  [19] ROCR_1.0-7                  limma_3.36.5               
##  [21] globals_0.12.4              annotate_1.58.0            
##  [23] R.utils_2.8.0               colorspace_1.3-2           
##  [25] blob_1.1.1                  ggrepel_0.8.0              
##  [27] crayon_1.3.4                RCurl_1.95-4.11            
##  [29] jsonlite_1.6                bigmemory.sri_0.1.3        
##  [31] graph_1.58.2                bindr_0.1.1                
##  [33] survival_2.42-6             zoo_1.8-3                  
##  [35] iterators_1.0.10            glue_1.3.0                 
##  [37] gtable_0.3.0                zlibbioc_1.26.0            
##  [39] XVector_0.22.0              Rhdf5lib_1.2.1             
##  [41] future.apply_1.0.1          scales_1.0.0               
##  [43] DBI_1.0.0                   edgeR_3.22.5               
##  [45] bibtex_0.4.2                Rcpp_1.0.1                 
##  [47] metap_1.0                   viridisLite_0.3.0          
##  [49] xtable_1.8-3                reticulate_1.10            
##  [51] bit_1.1-14                  rsvd_1.0.0                 
##  [53] SDMTools_1.1-221            tsne_0.1-3                 
##  [55] htmlwidgets_1.3             httr_1.3.1                 
##  [57] gplots_3.0.1                RColorBrewer_1.1-2         
##  [59] ica_1.0-2                   pkgconfig_2.0.2            
##  [61] XML_3.98-1.16               R.methodsS3_1.7.1          
##  [63] locfit_1.5-9.1              tidyselect_0.2.5           
##  [65] rlang_0.3.4                 later_0.8.0                
##  [67] AnnotationDbi_1.44.0        munsell_0.5.0              
##  [69] tools_3.5.1                 RSQLite_2.1.1              
##  [71] ggridges_0.5.1              evaluate_0.11              
##  [73] yaml_2.2.0                  npsurv_0.4-0               
##  [75] knitr_1.20                  bit64_0.9-7                
##  [77] fitdistrplus_1.0-14         caTools_1.17.1.1           
##  [79] purrr_0.2.5                 RANN_2.6.1                 
##  [81] bindrcpp_0.2.2              pbapply_1.3-4              
##  [83] future_1.10.0               mime_0.6                   
##  [85] R.oo_1.22.0                 compiler_3.5.1             
##  [87] plotly_4.8.0                png_0.1-7                  
##  [89] lsei_1.2-0                  tibble_2.1.1               
##  [91] stringi_1.2.4               lattice_0.20-35            
##  [93] pillar_1.3.1                Rdpack_0.8-0               
##  [95] lmtest_0.9-36               data.table_1.12.2          
##  [97] cowplot_0.9.3               bitops_1.0-6               
##  [99] bigmemory_4.5.33            gbRd_0.4-11                
## [101] raster_2.8-19               httpuv_1.5.1               
## [103] AUCell_1.5.5                R6_2.4.0                   
## [105] promises_1.0.1              KernSmooth_2.23-15         
## [107] RcisTarget_1.3.5            codetools_0.2-15           
## [109] MASS_7.3-50                 gtools_3.8.1               
## [111] assertthat_0.2.1            SummarizedExperiment_1.10.1
## [113] rprojroot_1.3-2             withr_2.1.2                
## [115] GenomeInfoDbData_1.1.0      doSNOW_1.0.16              
## [117] hms_0.4.2                   grid_3.5.1                 
## [119] tidyr_0.8.1                 rmarkdown_1.10             
## [121] Rtsne_0.15                  Biobase_2.40.0             
## [123] shiny_1.3.1

Setting up utility functions

There are a few utility functions that we’ll need later on. Note that the TF-IDF function is longer than it really needs to be as it provides several options and also has an out of memory matrix multiplication implementation that it falls back on.

Note that I use the lsi_workflow function with log_scale_tf=TRUE for the LSI logTF discussed in the blog post. This also seems to work pretty much same as the version of LSI that 10x uses, which I have tried to replicate in the tenx_lsi_workflow function. Both are maintained here in case you want to compare / verify.

########################################
# Utility functions
########################################
load_tenx_atac = function(matrix_fn, peaks_fn, barcodes_fn) {
  atac_matrix = readMM(matrix_fn)
  colnames(atac_matrix) = read.delim(barcodes_fn, header=FALSE)$V1
  peaks = read.delim(peaks_fn, header=FALSE)
  peaks = paste(peaks$V1, peaks$V2, peaks$V3, sep = '_')
  rownames(atac_matrix) = peaks
  return(atac_matrix)
}

filter_features = function(count_matrix, cells=10) {
  count_matrix = count_matrix[Matrix::rowSums(count_matrix) >= cells, ]
  return(count_matrix)
}

filter_cells = function(count_matrix, features_above_threshold=100, threshold=0) {
  count_matrix = count_matrix[, Matrix::colSums(count_matrix > threshold) >= features_above_threshold]
  return(count_matrix)
}

downsample_matrix = function(bmat, fraction_remaining=0.5, cells_per_site_min=1, sites_per_cell_min=1) {
  set.seed(2019)
  non_zero_entries = which(bmat@x > 0)
  indices_to_zero = sample(non_zero_entries, size=ceiling(length(non_zero_entries) * (1 - fraction_remaining)))
  bmat@x[indices_to_zero] = 0
  
  # Make sure to get rid of stuff that has gone to ~ 0 after downsampling
  bmat = filter_features(bmat, cells=cells_per_site_min)
  bmat = filter_cells(bmat, features_above_threshold=sites_per_cell_min)
  return(bmat)
}

########################################
# Functions for LSI
########################################
safe_tfidf_multiply = function(tf, idf,  block_size=2000e6) {
  result = tryCatch({
    result = tf * idf
    result
  }, error = function(e) {
    options(DelayedArray.block.size=block_size)
    DelayedArray:::set_verbose_block_processing(TRUE)
    
    tf = DelayedArray(tf)
    idf = as.matrix(idf)
    
    result = tf * idf
    result
  })
  result = as(result, "sparseMatrix")
  return(result)
}

tfidf = function(count_matrix, frequencies=TRUE, log_scale_tf=TRUE, scale_factor=100000) {
  # Use either raw counts or divide by total counts in each cell
  if (frequencies) {
    # "term frequency" method
    tf = t(t(count_matrix) / Matrix::colSums(count_matrix))
  } else {
    # "raw count" method
    tf = count_matrix
  }
  
  # Either TF method can optionally be log scaled
  if (log_scale_tf) {
    if (frequencies) {
      tf@x = log1p(tf@x * scale_factor)
    } else {
      tf@x = log1p(tf@x * 1)
    }
  }
  
  # IDF w/ "inverse document frequency smooth" method
  idf = log(1 + ncol(count_matrix) / Matrix::rowSums(count_matrix))
  
  # Try to just to the multiplication and fall back on delayed array
  tf_idf_counts = safe_tfidf_multiply(tf, idf)
  rownames(tf_idf_counts) = rownames(count_matrix)
  colnames(tf_idf_counts) = colnames(count_matrix)
  return(tf_idf_counts)
}

tenx_tfidf = function(count_matrix) {
  idf = log(ncol(count_matrix) + 1) - log(1 + Matrix::rowSums(count_matrix))
  tf_idf_counts = safe_tfidf_multiply(count_matrix, idf)
  
  rownames(tf_idf_counts) = rownames(count_matrix)
  colnames(tf_idf_counts) = colnames(count_matrix)
  tf_idf_counts = as(tf_idf_counts, "sparseMatrix")
  return(tf_idf_counts)
}

do_pca = function(tf_idf_counts, dims=50) {
  pca.results = irlba(t(tf_idf_counts), nv=dims)
  final_result = pca.results$u %*% diag(pca.results$d)
  rownames(final_result) = colnames(tf_idf_counts)
  colnames(final_result) = paste0('PC_', 1:dims)
  return(final_result)
}

########################################
# Helper functions for dim reduction
########################################
run_dim_reduction = function(atac_matrix, cell_embeddings, dims, metadata=NULL, reduction='pca.l2') {
  if (is.null(metadata)) {
    seurat_obj = Seurat::CreateSeuratObject(atac_matrix)
  } else {
    seurat_obj = Seurat::CreateSeuratObject(atac_matrix, meta.data = metadata)
  }

  seurat_obj[['pca']] = Seurat::CreateDimReducObject(embeddings=cell_embeddings, key='PC_', assay='RNA')
  seurat_obj = seurat_obj %>%
                Seurat::L2Dim(reduction='pca') %>%
                Seurat::RunUMAP(reduction = reduction, dims = dims) %>%
                Seurat::RunTSNE(reduction = reduction, dims = dims) %>%
                Seurat::FindNeighbors(reduction=reduction, nn.eps=0.25, dims=dims)
  return(seurat_obj)
}

plot_pc_correlation = function(seurat_obj, reduction, column='nCount_RNA') {
  coords = Seurat::Embeddings(seurat_obj, reduction=reduction)
  column_value = seurat_obj@meta.data[, column]
  correlations = apply(coords, 2, function(x) {cor(x, column_value, method='spearman')})
  correlations_df = data.frame(correlation=correlations, PC=1:ncol(coords))
  
  plot_obj = ggplot(correlations_df, aes(PC, correlation)) +
    geom_point() +
    theme_classic() +
    geom_hline(yintercept = 0, linetype='dashed', color='red')
    
  return(plot_obj)
}

########################################
# Wrapper functions for workflows
########################################
lsi_workflow = function(bmat, dims, metadata=NULL, log_scale_tf=TRUE, reduction='pca.l2', resolution=0.3) {
  tfidf_mat = tfidf(bmat, log_scale_tf=log_scale_tf)
  pca_mat = do_pca(tfidf_mat, dims=max(dims))
  
  seurat_obj = run_dim_reduction(bmat, pca_mat, dims, metadata) %>%
                Seurat::FindClusters(reduction=reduction, n.start=20, resolution=resolution)
  return(seurat_obj)
}

tenx_lsi_workflow = function(bmat, dims, metadata=NULL, reduction='pca.l2', resolution=0.3) {
  tfidf_mat = tenx_tfidf(bmat)
  pca_mat = do_pca(tfidf_mat, dims=max(dims))
  
  seurat_obj = run_dim_reduction(bmat, pca_mat, dims, metadata) %>%
                Seurat::FindClusters(reduction=reduction, n.start=20, resolution=resolution)
  return(seurat_obj)
}

cistopic_workflow = function(bmat, topic=c(40, 50)) {
  coords = str_split_fixed(rownames(bmat), '_', 3)
  new_coords = paste0(coords[, 1], ':', coords[, 2], '-', coords[, 3])
  rownames(bmat) = new_coords

  cisTopicObject = cisTopic::createcisTopicObject(bmat, project.name='mouse_atac')
  cisTopicObject = cisTopic::runModels(cisTopicObject, topic, seed=2019, nCores=1, burnin = 250, iterations = 500)
  return(cisTopicObject)
}

# This uses single core because multithreaded implementation interferes with Knitr. In running this code, any do.par=FALSE and num.cores=1 could be changed as needed.
snapatac_workflow = function(snap_file, promoter.df=NULL, blacklist.df=NULL, fragment_number_threshold=500, promoter_ratio_range=c(0.2, 0.8), sample_name='default', pc.num=50, window_z_range=c(-1.5, 1.5)) {
  x.sp = createSnap(
    file=snap_file,
    sample=sample_name,
    do.par=FALSE,
    num.cores=1
  )
  
  plotBarcode(
    obj=x.sp, 
    pdf.file.name=NULL, 
    pdf.width=7, 
    pdf.height=7, 
    col="grey",
    border="grey",
    breaks=50
  )
  
  x.sp = filterCells(
    obj=x.sp, 
    subset.names=c("fragment.num", "UMI"),
    low.thresholds=c(fragment_number_threshold, fragment_number_threshold),
    high.thresholds=c(Inf, Inf)
  )
  
  x.sp = addBmatToSnap(x.sp, bin.size=5000, num.cores=1)

  # Optionally filter cells based on ratio of reads in promoters
  if (!is.null(promoter.df)) {
    promoter.gr = GRanges(promoter.df[,1], IRanges(promoter.df[,2], promoter.df[,3]))
    ov = findOverlaps(x.sp@feature, promoter.gr)
    idy = queryHits(ov)
    promoter_ratio = SnapATAC::rowSums(x.sp[,idy, mat="bmat"], mat="bmat") / SnapATAC::rowSums(x.sp, mat="bmat")
    plot(log(SnapATAC::rowSums(x.sp, mat="bmat") + 1,10), promoter_ratio, cex=0.5, col="grey", xlab="log(count)", ylab="FIP Ratio", ylim=c(0,1 ))
    idx = which(promoter_ratio > promoter_ratio_range[1] & promoter_ratio < promoter_ratio_range[2])
    x.sp = x.sp[idx,]
  }
  
  x.sp = makeBinary(x.sp, mat="bmat");
  
  # Filter out non-standard contigs if present
  idy2 = grep("chrM|random", x.sp@feature)
  
  if (!is.null(blacklist.df)) {
  black_list.gr = GRanges(
    blacklist.df[,1], 
    IRanges(blacklist.df[,2], blacklist.df[,3])
  )
    idy1 = queryHits(findOverlaps(x.sp@feature, black_list.gr))
    
  } else {
    # No blacklist provided, so just ignore
    idy1 = c()
  }
  
  idy = unique(c(idy1, idy2))
  x.sp = x.sp[,-idy, mat="bmat"]
  
  # Filter based on frequency
  x.sp = filterBins(
    x.sp,
    low.threshold=window_z_range[1],
    high.threshold=window_z_range[2],
    mat="bmat"
  )
  
  plotBinCoverage(
    x.sp,
    pdf.file.name=NULL,
    col="grey",
    border="grey",
    breaks=10,
    xlim=c(-6,6)
  )
  
  x.sp = runJaccard(
    obj = x.sp,
    tmp.folder=tempdir(),
    mat = "bmat",
    max.var=2000,
    ncell.chunk=1000,
    do.par=FALSE,
    num.cores=1,
    seed.use=10
  )

  x.sp = runNormJaccard(
    obj = x.sp,
    tmp.folder=tempdir(),
    ncell.chunk=1000,
    method="normOVE",
    row.center=TRUE,
    row.scale=TRUE,
    low.threshold=-5,
    high.threshold=5,
    do.par=FALSE,
    num.cores=1,
    seed.use=10
  )

  x.sp = runDimReduct(
    x.sp,
    pc.num=pc.num,
    input.mat="jmat",
    method="svd",
    center=TRUE,
    scale=FALSE,
    seed.use=10
  )

  rownames(x.sp@bmat) = x.sp@barcode
  colnames(x.sp@bmat) = as.character(1:ncol(x.sp@bmat))

  return(x.sp)
}

snapatac_rerun_jaccard = function(snap_obj, pc.num=50) {
  
  snap_obj = runJaccard(
    obj = snap_obj,
    tmp.folder=tempdir(),
    mat = "bmat",
    max.var=2000,
    ncell.chunk=1000,
    do.par=FALSE,
    num.cores=1,
    seed.use=10
  )

  snap_obj = runNormJaccard(
    obj = snap_obj,
    tmp.folder=tempdir(),
    ncell.chunk=1000,
    method="normOVE",
    row.center=TRUE,
    row.scale=TRUE,
    low.threshold=-5,
    high.threshold=5,
    do.par=FALSE,
    num.cores=1,
    seed.use=10
  )

  snap_obj = runDimReduct(
    snap_obj,
    pc.num=pc.num,
    input.mat="jmat",
    method="svd",
    center=TRUE,
    scale=FALSE,
    seed.use=10
  )
}

seurat_workflow_on_cistopic = function(cistopicObject, resolution=0.3, method='Z-score', reduction='pca') {
  cistopicObject = cisTopic::selectModel(cistopicObject)

  cistopicObject.reduced_space = t(cisTopic::modelMatSelection(cistopicObject, target='cell', method=method))
  colnames(cistopicObject.reduced_space) = paste0('PC_', 1:ncol(cistopicObject.reduced_space))
  dimensions = ncol(cistopicObject.reduced_space)
  
  cistopicObject.seurat = run_dim_reduction(cistopicObject@binary.count.matrix, cistopicObject.reduced_space, dims=1:dimensions, reduction='pca')
  
  cistopicObject.seurat = cistopicObject.seurat %>% 
    Seurat::FindNeighbors(reduction=reduction, nn.eps=0.25, dims=1:dimensions) %>%
    Seurat::FindClusters(reduction=reduction, n.start=20, resolution=resolution)
  return(cistopicObject.seurat)
}

seurat_workflow_on_jaccard_pca = function(snap_obj, dims, resolution=0.3, reduction='pca') {
  pca_results.jaccard = snap_obj@smat@dmat %*% diag(snap_obj@smat@sdev)
  colnames(pca_results.jaccard) = paste0('PC_', 1:ncol(pca_results.jaccard))
  rownames(pca_results.jaccard) = snap_obj@barcode

  seurat_obj.jaccard = run_dim_reduction(t(snap_obj@bmat), pca_results.jaccard, dims, reduction=reduction)
  seurat_obj.jaccard = seurat_obj.jaccard %>%
    Seurat::FindNeighbors(nn.eps=0.25, dims=dims, reduction=reduction) %>%
    Seurat::FindClusters(n.start=20, resolution=resolution, dims=dims, reduction=reduction)
}

plot_clustering_comparison = function(seurat_obj1, seurat_obj2, reduction, description1='', description2 = '', cluster_column1='RNA_snn_res.0.3', cluster_column2='RNA_snn_res.0.3') {
  # Clusters as called on each dataset
  seurat_obj1 = SetIdent(seurat_obj1, value=cluster_column1)
  seurat_obj2 = SetIdent(seurat_obj2, value=cluster_column2)
  
  p1 = DimPlot(seurat_obj1, reduction = 'tsne', pt.size=0.25) +
            ggtitle(description1)
  
  p2 = DimPlot(seurat_obj2, reduction = 'tsne', pt.size=0.25) +
            ggtitle(description2)
  
  # Now swap the labels to verify they are finding the same groups
  seurat_obj1@meta.data$cluster_seurat_obj2 = seurat_obj2@meta.data[, cluster_column2]
  seurat_obj2@meta.data$cluster_seurat_obj1 = seurat_obj1@meta.data[, cluster_column1]
  
  seurat_obj1 = SetIdent(seurat_obj1, value='cluster_seurat_obj2')
  seurat_obj2 = SetIdent(seurat_obj2, value='cluster_seurat_obj1')
  
  p3 = DimPlot(seurat_obj1, reduction = reduction, pt.size=0.25) +
            ggtitle(paste0(description1, ' (', description2, ' clusters)'))
  
  p4 = DimPlot(seurat_obj2, reduction = reduction, pt.size=0.25) +
            ggtitle(paste0(description2, ' (', description1, ' clusters)'))
  
  (p1 + p3) / (p2 + p4)
}

Data

We’ll need some data files from our study Cusanovich, Hill, et al. [Cell 2018], a public dataset from 10X, and some other auxillary files. I’ve also processed some of these datasets with SnapTools and provide .snap files that allow for comparison of the use of peaks and windows as features.

dir.create('data_downloads', showWarnings = FALSE)

# Cusanovich and Hill et al.
download.file("http://krishna.gs.washington.edu/content/members/ajh24/mouse_atlas_data_release/matrices/atac_matrix.binary.qc_filtered.rds", "data_downloads/atac_matrix.binary.qc_filtered.rds")
download.file("http://krishna.gs.washington.edu/content/members/ajh24/mouse_atlas_data_release/metadata/cell_metadata.tissue_freq_filtered.txt", "data_downloads/cell_metadata.tissue_freq_filtered.txt")

# 10X Adult Mouse Brain
download.file("http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.tar.gz", "data_downloads/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.tar.gz")
untar("data_downloads/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.tar.gz", exdir = 'data_downloads/')

# CisTopic outputs for both datasets
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/WholeBrain_PFC.cistopic.rds', 'data_downloads/WholeBrain_PFC.cistopic.rds')
download.file('http://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/atac_v1_adult_brain_fresh_5k.cistopic.rds', destfile = 'data_downloads/atac_v1_adult_brain_fresh_5k.cistopic.rds')

# *.snap files produced by SnapTools for use with SnapATAC that I've made available
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/atac_v1_adult_brain_fresh_5k.snap', 'data_downloads/atac_v1_adult_brain_fresh_5k.snap')
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/WholeBrain_PFC.snap', 'data_downloads/WholeBrain_PFC.snap')

# Auxillary datafiles
## Promoter definitions
download.file('http://renlab.sdsc.edu/r3fang/share/Fang_2019/MOs_snATAC/genes/promoter.bed', 'data_downloads/promoters.mm10.bed')
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/promoters.mm9.bed', 'data_downloads/promoters.mm9.bed')
download.file('https://krishna.gs.washington.edu/content/members/ajh24/2019_04_19_atac_dim_reduction_blogpost/data_release/promoters.hg19.bed', 'data_downloads/promoters.hg19.bed') # not used here

## ENCODE blacklist regions
download.file('http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz', 'data_downloads/encode_blacklist.mm10.bed.gz')
download.file('http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm9-mouse/mm9-blacklist.bed.gz', 'data_downloads/encode_blacklist.mm9.bed.gz')
download.file('http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg19-human/wgEncodeHg19ConsensusSignalArtifactRegions.bed.gz', 'data_downloads/encode_blacklist.hg19.bed.gz')

Published flavor of LSI vs. using log-scaled tf matrix

As a first example, I’ll data from Whole Brain and PFC published as part of Cusanovich, Hill, et al. [Cell 2018]. I’m choosing this subset of the data because there is a recent preprint that uses a kmer-based method on this same subset (useful for comparison purposes). It’s worth noting that we only used a fairly carefully selected subset of all peaks for dimensionality reduction in the actual paper, which helped us quite a bit in that case. Here I am using a much simpler result to show that the version of LSI we used is fairly sensitive to the sites you use as input compared to the approach proposed here.

mouse.bmat = readRDS("data_downloads/atac_matrix.binary.qc_filtered.rds")
mouse.metadata = read.delim("data_downloads/cell_metadata.tissue_freq_filtered.txt", sep='\t')
rownames(mouse.metadata) = mouse.metadata$cell

selected_tissues = c("WholeBrain", "PreFrontalCortex") #, "Cerebellum")
selected_cells = subset(mouse.metadata, tissue %in% selected_tissues) %>% rownames()

mouse.metadata = mouse.metadata[selected_cells, ]
mouse.bmat = mouse.bmat[, selected_cells]

# Take top 100000 peaks by frequency.
# Some filtering is important here because the peaks in this study were a merged set of peaks so many would not ever be called 
# or perhaps even observed in more than a few cells within this particular subset (in contrast to datasets composed of individual samples).
# The particular number doesn't impact LSI results much, but smaller makes cisTopic a bit faster.
top_sites = rank(-Matrix::rowSums(mouse.bmat)) <= 100000
mouse.bmat = mouse.bmat[top_sites,]

Show the distribition of TF matrix entries when dividing by total counts. Note that this is not relevant for the 10x method, which just uses the matrix counts rather than dividing by total counts.

tf = t(t(mouse.bmat) / Matrix::colSums(mouse.bmat))
comparison_df = rbind(data.frame(value=tf@x, method='TF'), data.frame(value=log1p(tf@x * 100000), method='log TF'))

ggplot(comparison_df, aes(value)) +
  geom_histogram(bins = 70, aes(fill=method), color='black') +
  theme_classic() +
  scale_fill_manual(values=c("TF"="red", "log TF"="#d3d3d3")) +
  xlab('value in matrix') +
  facet_wrap(~method, scales='free')

Next, run LSI with and without log scaling the TF matrix in TF-IDF and compare, all with the same peaks/cells as input.

mouse.seurat.lsi = lsi_workflow(mouse.bmat, dims=2:50, metadata=mouse.metadata, log_scale_tf=FALSE, reduction='pca')

# note that pca.l2 here refers to an L2 normalized PCA space, which we find can sometimes (but not always) improve clustering results relative to using the raw PCA space. This is used throughout the post but doesn't seem to matter much for these datasets.
mouse.seurat.lsi_log = lsi_workflow(mouse.bmat, dims=2:50, metadata=mouse.metadata, log_scale_tf=TRUE, reduction='pca.l2')

Confirm that the first PC is highly correlated with depth in both cases (which is why we drop, I won’t show this in most cases moving forward, although you could also use something like limma to regress out log10(total counts)):

p1 = plot_pc_correlation(mouse.seurat.lsi, reduction = 'pca') +
      ggtitle('Typical LSI')

p2 = plot_pc_correlation(mouse.seurat.lsi_log, reduction = 'pca') +
      ggtitle('LSI With Log Scaled TF')

p1 + p2

Compare tSNE plots with original cluster annotations from each method and then applying cluster annotations derived from the method being compared to to evaluate concordance. Note that the primary “metrics” used to assess “better” performance or “equivalent” performance throughout this post are: 1) Clustering results at identical resolution with exactly the same sites and peaks used as input. 2) Concordance of dimensionality reduction/clustering visually between two or more methods.

While these are not perfect metrics, the differences that I care about here are large enough that I think they are sufficient for the claims I am trying to make. I am fairly confident in the claims given that very similar dimensionality reduction and clustering results are achieved across multiple different methods in many cases. Subtle differences would require more extensive comparison.

plot_clustering_comparison(mouse.seurat.lsi,
                           mouse.seurat.lsi_log,
                           reduction='tsne',
                           description1='Published LSI',
                           description2='LSI logTF',
                           cluster_column1='RNA_snn_res.0.3',
                           cluster_column2='RNA_snn_res.0.3')

At least from first glance, it seems like log scaling the TF matrix makes a big difference in this context. As noted in the blog post, both of these approaches appear qualitately much better than the results from a k-mer based approach.

Comparison to cisTopic

In order to confirm that the perceived improvement I was seeing here was reproducible with other tools (would make it less likely to be artifactual), I wanted to compare to cisTopic, an LDA-based approach. This takes quite a long time in my hands, so I am commenting the actual command out and just loading in precomputed results that were downloaded above. You can run the command yourself if you’d like. Note that in the blog post I also show a result from Lareau, Duarte, Chew, et al. that uses a kmer-based approach, which seems to perform worse than LSI or the other methods described here when used with this dataset.

Select the model from 40 or 50 topics (this is what I specified when running) and then run the usual dim reduction and clustering:

# mouse.cistopic = cistopic_workflow(mouse.bmat)
mouse.cistopic = readRDS('data_downloads/WholeBrain_PFC.cistopic.rds')
mouse.seurat.cistopic = seurat_workflow_on_cistopic(mouse.cistopic, resolution=0.3, method='Z-score')

It’s worth noting that the topics generated by LDA (cisTopic) don’t correlate with depth in the same way as with LSI above, which is why I didn’t drop any of them moving forward.

plot_pc_correlation(mouse.seurat.cistopic, reduction = 'pca')

Compare results as above:

plot_clustering_comparison(mouse.seurat.lsi_log,
                           mouse.seurat.cistopic,
                           reduction='tsne',
                           description1='LSI logTF',
                           description2='cisTopic',
                           cluster_column1='RNA_snn_res.0.3',
                           cluster_column2='RNA_snn_res.0.3')

Overall, this looks quite promising. Both are in the same ballpark and look qualitatively very similar. However, LSI takes much less time to run (on scale of a minute or two vs. several hours even for a fairly small dataset).

Note that we perform the PCA step of LSI using an approximate SVD implemented in irlba which is much much faster when your R installation is linked against a good linear algebra library on the backend (see Dependencies section above).

Comparison to a Jaccard-based method

For the Jaccard-based method, I am using SnapATAC after generating *.snap files using SnapTools for both our own mouse brain dataset and the 10x genomics dataset used above.

Note that in this comparison, to make sure that I am running LSI on the same input, I’ll be following the standard SnapATAC workflow and running LSI on a binary matrix across 5kb windows in the genome.

promoter.df = read.table("data_downloads/promoters.mm9.bed");
blacklist.df = read.table("data_downloads/encode_blacklist.mm9.bed.gz")

mouse.snap_obj = snapatac_workflow('data_downloads/WholeBrain_PFC.snap',
                                     promoter.df=promoter.df,
                                     blacklist.df = blacklist.df,
                                     window_z_range=c(-1.5, 1.5),
                                     fragment_number_threshold=1000,
                                     promoter_ratio_range=c(0.2, 0.8))

Now in order to more directly compare what LSI and SnapATAC performance, I will extract the final binary window matrix and run our LSI workflow on it.

mouse.seurat.lsi_log.windows = lsi_workflow(t(mouse.snap_obj@bmat), dims=2:50, log_scale_tf = TRUE)

Note that to avoid differences in downstream clustering approaches or other decisions independent of the actual reduction to a PCA space that might make a difference in perceived results, I extract the PCA coordinates from the SnapATAC object and then run the remainder of our downstream workflow on those. Note this will be faster than the above in part due to the fact that we are already starting from the PCA coodinates.

mouse.seurat.snap = seurat_workflow_on_jaccard_pca(mouse.snap_obj, 1:50)

Now plot the results:

plot_clustering_comparison(mouse.seurat.lsi_log.windows,
                           mouse.seurat.snap,
                           reduction='tsne',
                           description1='LSI logTF',
                           description2='SnapATAC',
                           cluster_column1='RNA_snn_res.0.3',
                           cluster_column2='RNA_snn_res.0.3')

10X Adult Mouse Brain Dataset

To see if the above also holds true on other datasets, I also looked at the adult mouse brain scATAC data recently posted by 10X.

First, read in the data:

tenx_mouse.bmat = load_tenx_atac('data_downloads/filtered_peak_bc_matrix/matrix.mtx', 'data_downloads/filtered_peak_bc_matrix/peaks.bed', 'data_downloads/filtered_peak_bc_matrix/barcodes.tsv')
tenx_mouse.bmat = filter_features(tenx_mouse.bmat, cells=50)

# Binarize the matrix for consistency
tenx_mouse.bmat@x[tenx_mouse.bmat@x > 1] = 1

Run both flavors of LSI. Note that this is a smaller dataset and I chose a higher resolution parameter than above for clustering here for the purposes of some of the comparisons in the blog. In all cases, I will consistently use the same resolution parameter when comparing results using two methods on the same dataset.

tenx_mouse.seurat.lsi = lsi_workflow(tenx_mouse.bmat, dims=2:50, log_scale_tf=FALSE, reduction='pca', resolution=1.5)
tenx_mouse.seurat.lsi_log = lsi_workflow(tenx_mouse.bmat, dims=2:50, log_scale_tf=TRUE, reduction='pca.l2', resolution=1.5)

Now plot the results:

plot_clustering_comparison(tenx_mouse.seurat.lsi,
                           tenx_mouse.seurat.lsi_log,
                           reduction='tsne',
                           description1='Published LSI',
                           description2='LSI logTF',
                           cluster_column1='RNA_snn_res.1.5',
                           cluster_column2='RNA_snn_res.1.5')

Again, the version of LSI with log scaling seems to provide substantially better resolution.

Comparison to cisTopic

Read in data same as above and run our workflow on cisTopic cell by topic matrix (again, you can run the actual command rather than reading in precomputed results if you’d like):

#tenx_mouse.cistopic = cistopic_workflow(tenx_mouse.bmat)
tenx_mouse.cistopic = readRDS('data_downloads/atac_v1_adult_brain_fresh_5k.cistopic.rds')
tenx_mouse.seurat.cistopic = seurat_workflow_on_cistopic(tenx_mouse.cistopic, resolution=1.5, method='Z-score')

Now plot the results:

plot_clustering_comparison(tenx_mouse.seurat.lsi_log,
                           tenx_mouse.seurat.cistopic,
                           reduction='tsne',
                           description1='LSI logTF',
                           description2='cisTopic',
                           cluster_column1='RNA_snn_res.1.5',
                           cluster_column2='RNA_snn_res.1.5')

Comparison to a Jaccard-based method

Comparing again to SnapATAC:

promoter.df = read.delim("data_downloads/promoters.mm10.bed", sep="\t")
blacklist.df = read.delim("data_downloads/encode_blacklist.mm10.bed.gz", sep="\t")

tenx_mouse.snap_obj = snapatac_workflow('data_downloads/atac_v1_adult_brain_fresh_5k.snap',
                                     promoter.df=promoter.df,
                                     blacklist.df = blacklist.df,
                                     window_z_range=c(-1.5, 1.5),
                                     fragment_number_threshold=500,
                                     promoter_ratio_range=c(0.2, 0.8))

LSI logTF on windows:

tenx_mouse.seurat.lsi_log.windows = lsi_workflow(t(tenx_mouse.snap_obj@bmat), dims=2:50, log_scale_tf = TRUE)

Remainder of our workflow on Jaccard-derived PCA space:

tenx_mouse.seurat.snap = seurat_workflow_on_jaccard_pca(tenx_mouse.snap_obj, 1:50)

Now plot the results:

plot_clustering_comparison(tenx_mouse.seurat.lsi_log.windows,
                           tenx_mouse.seurat.snap,
                           reduction='tsne',
                           description1='LSI logTF',
                           description2='SnapATAC',
                           cluster_column1='RNA_snn_res.0.3',
                           cluster_column2='RNA_snn_res.0.3')

Windows vs. Peaks

In the blog post, I discuss the pros and cons of using peaks vs. windows for dimensionality reduction. Here, I compare the results we would get with LSI when using either peaks or windows. Note that here I am redoing the peak-based LSI on each dataset such that the cells used match up with the window matrices, which where generated via SnapTools and therefore contain a slightly different subset of cell barcodes.

Our mouse brain dataset

Get set of cells that overlap in both peak and window matrices (generated via different tools, so not exactly overlapping). Then run LSI on each:

overlaps = intersect(colnames(mouse.bmat), mouse.snap_obj@barcode)
mouse.seurat.lsi_log.peaks_comparison = lsi_workflow(mouse.bmat[, colnames(mouse.bmat[, overlaps])], dims=2:50, log_scale_tf = TRUE)

# we pretty much already ran this, but the set of cells will be slightly different from before, just making sure we're all equal here
mouse.seurat.lsi_log.windows_comparison = lsi_workflow(t(mouse.snap_obj@bmat)[, rownames(mouse.snap_obj@bmat[overlaps, ])], dims=2:50, log_scale_tf = TRUE)

Plot for comparison:

plot_clustering_comparison(mouse.seurat.lsi_log.peaks_comparison,
                           mouse.seurat.lsi_log.windows_comparison,
                           reduction='tsne',
                           description1='LSI logTF peaks',
                           description2='LSI logTF windows',
                           cluster_column1='RNA_snn_res.0.3',
                           cluster_column2='RNA_snn_res.0.3')

Very similar results using peaks/windows at least in this context.

10x genomics dataset

Same but on the 10X genomics mouse brain dataset:

overlaps = intersect(colnames(tenx_mouse.bmat), paste0(tenx_mouse.snap_obj@barcode, '-1')) # SnapATAC removes the -1
overlaps_no_dash = stringr::str_replace(overlaps, '-1', '') # SnapATAC removes the -1
tenx_mouse.seurat.lsi_log.peaks_comparison = lsi_workflow(tenx_mouse.bmat[, colnames(tenx_mouse.bmat[, overlaps])], dims=2:50, log_scale_tf = TRUE)

# we pretty much already ran this, but the set of cells will be slightly different from before, just making sure we're all equal here
tenx_mouse.seurat.lsi_log.windows_comparison = lsi_workflow(t(tenx_mouse.snap_obj@bmat)[, rownames(tenx_mouse.snap_obj@bmat[overlaps_no_dash, ])], dims=2:50, log_scale_tf = TRUE)

Plot for comparison:

plot_clustering_comparison(tenx_mouse.seurat.lsi_log.peaks_comparison,
                           tenx_mouse.seurat.lsi_log.windows_comparison,
                           reduction='tsne',
                           description1='LSI logTF peaks',
                           description2='LSI logTF windows',
                           cluster_column1='RNA_snn_res.0.3',
                           cluster_column2='RNA_snn_res.0.3')

Again, very similar results using peaks/windows at this level of clustering. Would need to look at this further to understand if either has a clear performance advantage.

Downsampled data

If we purely want to assess the impact of overall complexity on dimensionality reduction, we can just choose a subset of windows to zero out at random.

It is worth noting that this by no means captures the real world variation in data quality for scATAC-seq data. Signal to noise ratio (e.g. as measured by FRiP, promoter enrichment, etc.) is also very important and would not be captured by this simple test. By this I just mean to say that datasets can look very different even at the same overall complexity due to a number of other factors.

Here I chose to use the 10x mouse dataset rather than running on both of the datasets above, but in principle you could run either using similar commands.

Try downsampling to 20% and 50% of the original depth:

set.seed(2019)

tenx_mouse.window_bmat.downsampled_20 = downsample_matrix(t(tenx_mouse.snap_obj@bmat), fraction_remaining=0.2, cells_per_site_min=20, sites_per_cell_min=1)
tenx_mouse.window_bmat.downsampled_50 = downsample_matrix(t(tenx_mouse.snap_obj@bmat), fraction_remaining=0.5, cells_per_site_min=20, sites_per_cell_min=1)

median(Matrix::colSums(tenx_mouse.window_bmat.downsampled_20))
median(Matrix::colSums(tenx_mouse.window_bmat.downsampled_50))

Run both versions of LSI and Jaccard method on each downsampled dataset:

tenx_mouse.seurat.lsi.downsampled_20 = lsi_workflow(tenx_mouse.window_bmat.downsampled_20, dims=2:50, log_scale_tf=FALSE, reduction='pca')
tenx_mouse.seurat.lsi_log.downsampled_20 = lsi_workflow(tenx_mouse.window_bmat.downsampled_20, dims=2:50, log_scale_tf=TRUE, reduction='pca.l2')

tenx_mouse.seurat.lsi.downsampled_50 = lsi_workflow(tenx_mouse.window_bmat.downsampled_50, dims=2:50, log_scale_tf=FALSE, reduction='pca')
tenx_mouse.seurat.lsi_log.downsampled_50 = lsi_workflow(tenx_mouse.window_bmat.downsampled_50, dims=2:50, log_scale_tf=TRUE, reduction='pca.l2')

# We also never ran the published LSI method on windows, so do that too for comparison
tenx_mouse.seurat.lsi.windows = lsi_workflow(t(tenx_mouse.snap_obj@bmat), dims=2:50, log_scale_tf = FALSE)

# Also set up SnapATAC using these downsampled matrices
windows_20_idx = which(rownames(tenx_mouse.window_bmat.downsampled_20) %in% rownames(t(tenx_mouse.snap_obj@bmat)))
windows_50_idx = which(rownames(tenx_mouse.window_bmat.downsampled_50) %in% rownames(t(tenx_mouse.snap_obj@bmat)))
tenx_mouse.snap_obj.downsampled_20 = tenx_mouse.snap_obj[,windows_20_idx, mat='bmat']
tenx_mouse.snap_obj.downsampled_50 = tenx_mouse.snap_obj[,windows_50_idx, mat='bmat']
tenx_mouse.snap_obj.downsampled_20@bmat = t(tenx_mouse.window_bmat.downsampled_20)
tenx_mouse.snap_obj.downsampled_50@bmat = t(tenx_mouse.window_bmat.downsampled_50)

# Rerun SnapATAC Jaccard method on downsampled matrices
tenx_mouse.snap_obj.downsampled_20 = snapatac_rerun_jaccard(tenx_mouse.snap_obj.downsampled_20)
tenx_mouse.seurat.snap.downsampled_20 = seurat_workflow_on_jaccard_pca(tenx_mouse.snap_obj.downsampled_20, 1:50)
tenx_mouse.snap_obj.downsampled_50 = snapatac_rerun_jaccard(tenx_mouse.snap_obj.downsampled_50)
tenx_mouse.seurat.snap.downsampled_50 = seurat_workflow_on_jaccard_pca(tenx_mouse.snap_obj.downsampled_50, 1:50)

Compare both methods at progressively higher downsampling rates, always coloring by original LSI logTF clusters

# Put original clusters in all
tenx_mouse.seurat.lsi.downsampled_20$original = tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3
tenx_mouse.seurat.lsi_log.downsampled_20$original = tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3
tenx_mouse.seurat.lsi.downsampled_50$original = tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3
tenx_mouse.seurat.lsi_log.downsampled_50$original = tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3
tenx_mouse.seurat.lsi.windows$original = tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3
tenx_mouse.seurat.snap.downsampled_20$original = tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3
tenx_mouse.seurat.snap.downsampled_50$original = tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3
tenx_mouse.seurat.snap$original = tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3

tenx_mouse.seurat.lsi.downsampled_20 = SetIdent(tenx_mouse.seurat.lsi.downsampled_20, value='original')
tenx_mouse.seurat.lsi_log.downsampled_20 = SetIdent(tenx_mouse.seurat.lsi_log.downsampled_20, value='original')
tenx_mouse.seurat.lsi.downsampled_50 = SetIdent(tenx_mouse.seurat.lsi.downsampled_50, value='original')
tenx_mouse.seurat.lsi_log.downsampled_50 = SetIdent(tenx_mouse.seurat.lsi_log.downsampled_50, value='original')
tenx_mouse.seurat.lsi.windows = SetIdent(tenx_mouse.seurat.lsi.windows, value='original')
tenx_mouse.seurat.lsi_log.windows = SetIdent(tenx_mouse.seurat.lsi_log.windows, value='RNA_snn_res.0.3')

tenx_mouse.seurat.snap.downsampled_20 = SetIdent(tenx_mouse.seurat.snap.downsampled_20, value='original')
tenx_mouse.seurat.snap.downsampled_50 = SetIdent(tenx_mouse.seurat.snap.downsampled_50, value='original')
tenx_mouse.seurat.snap = SetIdent(tenx_mouse.seurat.snap, value='original')

p1 = DimPlot(tenx_mouse.seurat.lsi.downsampled_20, reduction='tsne', pt.size=0.25) + ggtitle('Published LSI (20%)')
p2 = DimPlot(tenx_mouse.seurat.lsi_log.downsampled_20, reduction='tsne', pt.size=0.25) + ggtitle('LSI logTF (20%)')
p3 = DimPlot(tenx_mouse.seurat.snap.downsampled_20, reduction='tsne', pt.size=0.25) + ggtitle('SnapATAC (20%)')
p4 = DimPlot(tenx_mouse.seurat.lsi.downsampled_50, reduction='tsne', pt.size=0.25) + ggtitle('Published LSI (50%)')
p5 = DimPlot(tenx_mouse.seurat.lsi_log.downsampled_50, reduction='tsne', pt.size=0.25) + ggtitle('LSI logTF (50%)')
p6 = DimPlot(tenx_mouse.seurat.snap.downsampled_50, reduction='tsne', pt.size=0.25) + ggtitle('SnapATAC (50%)')
p7 = DimPlot(tenx_mouse.seurat.lsi.windows, reduction='tsne', pt.size=0.25) + ggtitle('Published LSI (100%)')
p8 = DimPlot(tenx_mouse.seurat.lsi_log.windows, reduction='tsne', pt.size=0.25) + ggtitle('LSI logTF (100%)')
p9 = DimPlot(tenx_mouse.seurat.snap, reduction='tsne', pt.size=0.25) + ggtitle('SnapATAC (100%)')

(p1 + p4 + p7) / (p2 + p5 + p8) / (p3 + p6 + p9)

Both SnapATAC and our logTF version of LSI seem to perform quite comarably at reduced depth. While the plots above look very consistent, there is a slight decrease in the number of clusters called by Seurat for all approaches with downsampling using a consistent resolution parameter (although using a higher resolution paramter would probably enable similar groupings, which is great).

print('Published LSI')
max(as.numeric(tenx_mouse.seurat.lsi.downsampled_20@meta.data$RNA_snn_res.0.3))
max(as.numeric(tenx_mouse.seurat.lsi.downsampled_50@meta.data$RNA_snn_res.0.3))
max(as.numeric(tenx_mouse.seurat.lsi.windows@meta.data$RNA_snn_res.0.3))

print('---------')
print('LSI logTF')
max(as.numeric(tenx_mouse.seurat.lsi_log.downsampled_20@meta.data$RNA_snn_res.0.3))
max(as.numeric(tenx_mouse.seurat.lsi_log.downsampled_50@meta.data$RNA_snn_res.0.3))
max(as.numeric(tenx_mouse.seurat.lsi_log.windows@meta.data$RNA_snn_res.0.3))

print('---------')
print('SnapATAC')
max(as.numeric(tenx_mouse.seurat.snap.downsampled_20@meta.data$RNA_snn_res.0.3))
max(as.numeric(tenx_mouse.seurat.snap.downsampled_50@meta.data$RNA_snn_res.0.3))
max(as.numeric(tenx_mouse.seurat.snap@meta.data$RNA_snn_res.0.3))
## [1] "Published LSI"
## [1] 4
## [1] 7
## [1] 8
## [1] "---------"
## [1] "LSI logTF"
## [1] 11
## [1] 16
## [1] 16
## [1] "---------"
## [1] "SnapATAC"
## [1] 13
## [1] 15
## [1] 15

For logTF LSI there are 16, 16, and 11 clusters called at this resolution with decreasing depth (8, 7, and 4 without log scaling). For SnapATAC the number of clusters called is 15, 15, and 13. Therefore, it seems like at least the last downsampling has some deterimental effect in this case for all techniques although I can’t rule out the possibility that choosing slightly different site-level filtering strategies might make a difference here or that use of cluster counts is misleading in some way.

It’s possible that one of the two might perform better in an iterative context or other contexts in which there are less obvious differences between cells. In poking around so far, I havne’t found a case where one clearly outperforms the other, but I plan to continue to compare the two on other datasets and would be very interested if someone does find some good examples.

It would also be quite interesting to simulate the impact of decreased FRiP independent of complexity (downsampling at a given rate but changing the ratio of downsampling in and out of peaks/promoters).